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ABSTRACT 

The development of fast and accurate methods of photometric redshift estimation is a 
vital step towards being able to fully utilize the data of next-generation surveys within 
precision cosm ology. In this paper we ap ply a specific approach to spectral connectivity 
analysis (SCA; iLee fc Wassermaiill2009[ ) called diffusion map. SCA is a class of non- 
linear techniques for transforming observed data (e.g., photometric colours for each 
galaxy, where the data lie on a complex subset of p-dimensional space) to a simpler, 
more natural coordinate system wherein we apply regression to make redshift predic- 
tions. In previous application s of SCA to other astronomical problems (|Richards et al.l 
l2009al [Richards et al.ll2009H) . we demonstrate its superiority vis-a-vis Principal Com- 
ponents Analysis (PCA), a standard linear technique for transforming data. As SCA 
relies upon eigen-decomposition, our training set size is limited to < 10* galaxies; we 
use the Nystrom extension to quickly estimate diffusion coordinates for objects not in 
the training set. We apply our method to 350,738 SDSS main sample galaxies, 29,816 
SDSS luminous red galaxies, and 5,223 galaxies from DEEP2 with CFHTLS ugriz pho- 
tometry. For all three datasets, we achieve prediction accuracies on par with previous 
analyses, and find that use of the Nystrom extension leads to a negligible loss of pre- 
diction accura cy relative to that achieved with the train ing sets. As in some previous 
analyses fe.g.. ICollister fc Lahav|[200llBall et al.ll2008D . we observe that our predic- 
tions are generally too high (low) in the low (high) redshift regimes. We demonstrate 
that this is a manifestation of attenuation bias, wherein measurement error (i.e., uncer- 
tainty in diffusion coordinates due to uncertainty in the measured fluxes/magnitudes) 
reduces the slope of the best-fit regression line. Mitigation of this bias is necessary 
if we are to use photometric redshift estimates produced by computationally efficient 
empirical methods in precision cosmology. 

Key words: galaxies: distances and redshifts - galaxies: fundamental parameters - 
galaxies: statistics - methods: statistical - methods: data analysis 



1 INTRODUCTION 

The accurate estimation of redshifts from photometric data 
is a key component to fulfilling the promise of next- 
generation cosmological surveys. For instance, photometry 
to i? ~ 30 is expected for billions of galaxies from the Large 
Synoptic Survey Telescope (LSST; Jvezic et al., 2008) alone; 
compare this to, e.g., the ~ 10^ spectra collecte d to a depth 
R ^ 24 by the DEEP2 G alaxy Redshift Survey (iDavis et al.1 
I2OO3I . jPavis et al.ll2007l ). It is clear that redshift-dependent 
analyses of galaxies that aim to undercover signatures of, 
e.g., weak lensing or baryon acoustic oscillations in imaging 
data will require the use of photometric redshifts. 
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Redshifts estimated via, e.g., ugriz photometry will 
necessarily lack the precision of those that are spectro- 
scopically derived due to noise, outliers, weak spectral fea- 
tures, and incomplete spectral energy distribution (SED) 
templates. For this reason, one major goal of photometric 
redshift estimation is to generate accurate ensembles of es- 
timates, i.e., to have the mean redshift within a photometric 
reds hift bin be an accura t e estimator of true redshi ft (see, 
e.g., lAlbrecht etal] l200d iMa. Hu. fc Hutere j [ioo^ ) . Such 
ensembles are typically generated via one of two methods: 
either template fitting, wherein redshifted SED templates 
are generally compared to a given vector of magnitudes 
with the goal of minim izing the statistic or maximiz- 
ing t h e likelihood (e . g.. iFernandez-Sqto. Lanzetta. fc Yahill 
1 19991 . iBenitej I2OO0I . iFeldmann et al.l |2006| ). or empirical 
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methods, wherein one uses photometry from a small collec- 
tion of objects with spectroscopically confirmed redshifts to 
train a model relating photometric colours to redsh ifts (e.g., 
Connolly et al. 1995. Vanzclla ct al. 2004. CoUister fc Lahayl 



20041. iBudavari et al.lf2005l . lBall et al.ll2007l . iBall et al.ll2008l . 



Ovaizu et al .1120081 ). Some combine th e two approaches (e.g 



Ilbert et all |2006| . Illbert et al.ll2008l ). while others propose 



folding in information b eyond photometr ic colours (e.g., 
CoUister fc Lahavl |2004| . iBall et all |2004 IWray fc GunnI 



20081 . lNewmanll2008l ) 



In this paper we propose a new empirical method for 
photometric redshift estimation based on t he diffusion map 
ICoifman fc Lafonll2006l . iLafon fc Lej|2006 'l , which is an ap- 
proach to spectral connectivity analysis (SCA). SCA is a 
suite of established non-linear eigen-technique^ that cap- 
ture the underlying geometry of data by propagating local 
neighborhood information through a Markov process. SCA 
thus allows one to find a natural coordinate system for data 
such as photometric colours whose original parametriza- 
tion is not am e nable t o av ailable statistica l techn iques. In 
[Richards et~aLl (|2009al ') and lRichards et all (12009^ 1. we ap- 
ply the diffusion map t o two d ifferent astronomical prob- 
lems. In [Richards et al.l (|2009al ). we develop a framework 
combining diffusion map and adaptive linear regression and 
apply it to SDSS spectroscopic data, demonstrating how it 
may be used to reduce the dimensionality of the data space 
and to predict, e.g., redshifts in a computationally efficient 
manner. We also demonstrate the superiority of the diffusion 
map to principal components analy sis, a related, much mor e 
commonly used linear technique. In lRichards et ahl (l2009bl ). 
we utilize the diffusion map and the K-means clustering al- 
gorithm to determine optimal bases of simple stellar pop- 
ulation spectra that we use to estimate the star-formation 
histories of galaxies. 

In !j2l we review the basics of our diffusion map and 
regression framework, and introduce a new com ponent: the 
appli cation of the Nystrom extension (see, e.g.. IPress et al.l 
1 19921 ). a computationally efficient and accurate technique 
for estimating diffusion coordinates for new objects given 
those of the training set. In ^ we apply our framework 
to Sloan Digital Sky Survey data, specifically main sam- 
ple galaxies (MSGs) and luminous red galaxies (LRGs) , and 
demonstrate that we achieve accuracy on par with that of 
more computationally intensive techniques. We also apply 
our framework to data from the DEEP2 Galaxy Redshift 
Survey that is matched to ugriz photometry of the C anada- 
Fran ce-Hawaii Telescope Legacy Survey (CFHTLS; ICwynl 
l2008l 'l and demonstate that it provides accurate estimation 
of redshifts to 2 ~ 0.75 given four colours alone. We demon- 
strate that the bivariate distributions of photometric and 
spectroscopic redshifts for SDSS and DEEP2 are affected by 
attenuation bias, the tendency of measurement error in the 
predictor to reduce the slope of linear models. Last, in SJJI we 
summarize our results and discuss how we can extend our 
framework to the high redshift regime where spectroscopic 
coverage will be incomplete. 



^ The name SCA is applied to these eigen-techniques by 
iLee fc WassermanI ll2009h . who study their statistical properties. 



2 ALGORITHM 
2.1 Diffusion map 

In this section we review the basics of diffusion map con- 
struction, an approach to SCA . For more details, we refe r 
the reader to'Coifman fc Lafon' ('2006VLafo n_fc Led (|2006h . 
and .Richards et ah, (2009a). In Richard s e t al.l . we compare 
and contrast the use of diffusion maps with a more com- 
monly utilized linear technique, principal components anal- 
ysis, and demonstrate the superiority of diffusion maps in 
predicting spectroscopic redshifts of SDSS data from the 
galaxy spectra. 

Here, "spectral connectivity analysis" refers to a class of 
methods which utilize a local distance measure to "connect" 
similar observations. The eigenmodes (i.e., "spectral decom- 
position") of the rescaled matrix of similarities (see below 
for the definition of this matrix) can reveal a natural coor- 
dinate system for data that was absent in the original rep- 
resentation. For instance, imagine data in two dimensions 
th at to the eye clearly exhibit spiral structure (e.g., fig. 1 
of I Richards et aL I l2009al ). For such data, the Euclidean dis- 
tance between data points x and y would not be an optimal 
description of the 'true' distance between them along the 
spiral. Diffusion map is a leading example of an approach to 
SCA. In the diffusion map framework, the 'true' distance is 
estimated via a Active diffusion process over the data, with 
one proceeding from x to y via a random walk along the 
spiral. 

We construct diffusion maps as follows. 

We define a similarity measure s(x,y) that quantita- 
tively relates two data points x and y. In this work, a data 
'point' is a vector of colours {ci, . . . , Cp} of length p for a 
single galaxy, and the similarity measure that we apply is 
the Euclidean distance 



s{x,y) = 



A key feature of SCA is that the choice of s{x,y) is not 
crucial, as it is often simple to determine whether or not 
two data points are 'similar.' 

We remove extreme outliers from our dataset, not be- 
cause of their effect on diffusion map construction (a hall- 
mark of the diffusion map is its robustness in the presence of 
outliers), but rather because they can bias the coefficients 
of the linear regression model (see ^22} and because we 
find that individual predictions made for these objects are 
highly inaccurate. We compute the empirical distributions 
of Euclidean distances in colour space from each object to its 
n**^ nearest neighbor, where n€ [1,10]. These distributions 
are well-described as exponential, with estimated mean and 
standard deviation fin = (Jn = i„/log(2) for median value 
Xn- We exclude all data whose n"^ nearest neighbor is at a 
distance > /i„ -f- 5o"„ = 6(t„, for any value of n G [1, 10]. We 
find that ~ 80% of extreme outliers are removed with the 
first nearest-neighbor cut alone, with the fraction of those 
removed falling as n increases. 

With outliers removed, we construct a weighted graph 
where the nodes are the observed data points: 



w{x,y) = cxp - 



s{x,y) 



(1) 
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where e is a tuning parameter that should be small enough 
that w{x,y) ~ unless x and y axe similar, but large 
enough such that the graph is fully connected. (We discuss 
how we estimate e in 8^2.21 ') The probability of stepping from 
cc to y in one step is pi(a;, y) = w{x,y) / '^^■w(x, z). We 
store the one-step probabilities between all n data points in 
an n X n matrix P; then, by the theory of Markov chains, 
the probability of stepping from x to y int steps is given by 
the element pt{x,y) of the matrix P* . The diffusion distance 
between x and y at time t is defined as 

oc 

Dt{x,y) = ^Xf{ipjix)-ipj{y)f, 

where tpj and Xj represent eigenvectors and eigenvalues of P, 
respectively. By retaining the m eigenmodes corresponding 
to the m largest nontrivial eigenvalues and by introducing 
the diffusion map 



*t : a; 1-^ [Xlipiix), X\tp2{x),- ■ ■ , A™V™.(a^)] 



(2) 



from 



to 



we have that 



D?(x,y) ^ ^Af(V^,(x)-V.(y)f = ||*t(a;)-*,(y)lf 

i.e., the Euclidean distance in the m- dimensional embed- 
ding defined by equation [2] approximates diffusion distance. 
(We discuss how we estimate m in i]2.2l and show that, in 
this work, the choice of t is unimportant.) We stress that 
the diffusion map reparametrizes the data into a coordinate 
system that reflects the connectivity of the data, and does 
not necessarily affect dimension reduction. If the original 
parametrization in R'' is sufficiently complex, then it may 
be the case that m 3> p- 



2.2 Regression 



As in lRichards et all (|2009ah . we perform linear regression to 
predict the function z — r(^'t), where z is true redshift and 
^'t is a vector of diffusion coordinates in R™, representing 
a vector of photometric colours x in R'': 

m 

f(*t) = *3 = ^^,*t,,(x) 

J = l 

m m 

= J2^,X'^i^,{x) = ^^-1^,(0;) 

We see that the choice of the parameter t is unimportant, as 
changing it simply leads to a rescaling in /3j , with no change 
in P'j. We present relevant regression formulae in Appendix 

We determine optimal values of the tuning parame- 
ters (e,m) by minimizing estimates of the prediction risk, 
R{e, m) = E(L), where E(I/) is the expected value of a loss 
function L over all possible realizations of the data (one ex- 
ample of L is the so-called L2 loss functio n, which is simply 
the mean-squared error of the fit; see, e.g., IWassermanll2006l 
for a discussion of this and other topics introduced below). R 
quantifies the 'bias-variance' tradeoff; too much smoothing 
(m too low) yields prediction estimators with low variance 
and high bias, while too little smoothing (m too high) yields 
estimators with high variance and low bias. Using the full 



data set to estimate R underestimates the error and leads to 
a best-fit model with high bias, thus we apply 10-fold cross- 
validation (CV). The data are partitioned into 10 blocks of 
(approximately) equal size. We regress upon the data in nine 
of the blocks and use the best-fit regression model to pre- 
dict the responses % for the data in the tenth block. (We 
note that for algorithmic consistency we use the Nystrom 
extension to estimate the diffusion coordinates of the data 
in the tenth block; see ^2.31 ) The process is repeated 10 
times, for different block combinations, so that predictions 
are generated for each datum. The individual predictions are 
combined into an overall risk estimate 



Rcv{€,m) 



-E 



(3) 



where we apply the redshift-corrected rms dispersion as 
our loss function. Zi is the estimated spectroscopic red- 
shift for object i. (We capitalize to underscore the fact that 
the spectroscopic redshift is a random variable not neces- 
sarily equal to the true redshift Zi.) To ensure robustness, 
for each set of tuning parameters (e, m), we compute the 
mean Rcv(^,'m) of 10 estimates of Rcv(^,'ni), and select 
those values of (e,m) such that Rcvi^yfn) is minimized, 
i.e., (?, m) = arg min i?cv(e, »n). 



2.3 Diffusion coordinate estimation via the 
Nystrom extension 

The computation of diffusion coordinates (equation [2]) re- 
lies upon eigen-decomposition, which is computationally 
intractable for data sets of > 10^ galaxies. (However, see 
iBudavari et al.ll2009l . who propose an incremental methodol- 
ogy for computing eigenvectors.) Photometric datasets can, 
of course, be much larger, and thus we require a computa- 
tionally efficient scheme for estimating eigenvectors for new 
galaxies given those computed for a small set of galaxies 
used to train the regression model. A standard method in 
applied mathematics for 'extending' a set of eigenvectors is 
the Nystrom extension. 

The implementation is simple: determine the distance in 
colour space from each new galaxy to its nearest neighbors 
in the training set, then take a weighted average of those 
neighbors' eigenvectors. Let X represent the n x k matrix 
containing the colour data of the training set, where n and 
k are the number of objects and colours, respectively. Let 
X' represent a similar n' x k matrix containing colour data 
for n' objects in the validation set. The first step of the 
Nystrom extension is to compute the n' x n weight matrix 
W, with elements equivalent to those shown in equation [T] 
above (except that there, x and y are both members of the 
training set, while here, a; is a new point while y belongs to 
the training set). We assume the same value ? as was selected 
during diffusion map construction; since the training set is a 
random sample of galaxies from our original set, we expect 
the validation set to be sampled from the same underlying 
probability distribution. We row-normalize W by dividing 
by each element in row i' by p^/ = Wii^i. 

Let ^' be the n x m matrix of eigenvectors with cor- 
responding vector of eigenvalues A. To estimate the eigen- 
vectors for the new galaxies, we compute the n' x m matrix 
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*' = W*A, (4) 

where A is a m x m diagonal matrix with entries 1/ Ai . Then 
the redshift predictions for the n' objects are z — (3, 
where /3 are the linear regression coefficients generated for 
the original training set. 

3 APPLICATION TO SDSS AND DEEP2 
DATASETS 

3.1 SDSS spectroscopic data 

In this work, we use the Princeton/MIT reductions of SDSS 
spectroscopic dat£0- Features of these data include the so- 
called 'uber-c alibration' of uqriz magnit udes in six magni- 
tude systems l|Padmanabhan et al.ll200^ ). To facilitate a di- 
rect comparison of our results with those of lBall et ahlbOOSl . 
we utilize colours, i.e., differences between the magnitudes 
measured in different bands determined in each of four mag- 
nitude systems: psf, fiber, petrosian, and model. Thus the 
colour data occupy a p = 16 dimensional space. 

The necessary data are contained in the files 
spAll-<rel>.fits, where <rel> = EDR and DR1-DR6. 
We extract data from all publicly available plates for which 
PRDGNAME = 'main' and PLATEQUALITY = 'good,' keeping 
1001 plates in all. (We keep only one instance of each 
plate when repeated observations are made, making the ad 
hoc choice to retain the most recent observation.) For each 
plate, we examine data for those fibers for which CLASS = 
'GALAXY,' Z > 0.01, and ZWARNING = 0. For each of these 
fibers, we apply extinction corrections {A} (from column 
EXTINCTION) to the set of fl uxes |_F| and the set of esti- 
mated standard errors {sf} (|Finkbeiner et al.ll2004 ): 

F' = lO^-^^F 
Sf 

Spi — — . 

VlO-0'8^ 

If for any object, one or more elements of the set {F'} < 
0, we exclude the object from analysis. The flux units are 
nanomaggies; the conversion from F' to magnitude m' is 
m' — 22.5 — 2.51ogj^Q F' , while the conversion to colours is 
c;_, =2.51ogio(F;/i^/). 

The final number of galaxies in our sample is 417,224. 

3.1.1 Main sample galaxies 

From our data sample, we extract those 360,122 galaxies 
with P etrosian r-band m agnitude < 17.77 (or Fp^^^^ > 
77.983: [Strauss et al.ll200^ ). This is our main sample galaxy 
or MSG sample. We randomly select 10,000 galaxies from 
this sample to train our regression model. Application of 
the outlier-removal algorithm described in i32. 21 leads to the 
removal of 251 galaxies from this set. The application of 
the algorithm outlined in §§2.1-2 yields tuning parameter 
estimates (?, m) = (0.05,150), i.e., in order for a linear 
model to be appropriate, the 16-dimensional colour data is 
reparametrized into 150-dimensional space. 

As each object's eigenvector estimates are independent 

^ See |http : // spectro ■ pr Inceton . edu | 



Table 1. Parameters of optimal regression 



Dataset 


(?, fh) 


Rev 


V {%) 


n 


n-out 


MSG-T 


(0.05,150) 


0.0206 
(0.0231) 


0.010 


9,749 


251 


MSG-V 




0.0211 
(0.0240) 


0.018 


340,989 


9,384 


LRG-T 


(0.012,200) 


0.0189 
(0.0258) 


0.010 


9,734 


266 


LRG-V 




0.0195 
(0.0270) 


0.034 


20,082 


884 


DEEP2-T 


(0.002,850) 


0.0507 
(0.1063) 


1.67 


5,223 


304 


DEEP2-T 


(0.002,1050) 


0.0539 
(0.1123) 


2.14 


6,067 


351 



In the column 'Dataset,' T = training set and V = validation 
set. rj is the rate of catastrophic failures (i.e., the rate at which 
S > 0.15), n is the number of galaxies used in analysis after outlier 
removal, and riout is the number of 5cr outliers removed from 
sample. The number (outside/inside) the parantheses in column 
Rev (includes/does not include) normalization by (l+Z). M-band 
data are excluded from LRG analyses. For DEEP2-T, the first and 
second rows represent analyses of objects for which ZQUALITY = 
4 and ZQUALITY > 3, respectively. 

of those for other objects, we apply the Nystrom extension to 
validation set objects one plate at a time, then concatenate 
the resulting predictions. We determine which members of 
the validation set are 5a outliers relative to the members 
of the training set, and compute the value of Rev with 
those objects excluded. (Not excluding these outliers, which 
lie too far from the training set in colour space for their 
diffusion coordinates to be estimated accurately, results in 
Rev rising from « 0.02 to 0.56.) Out of 350,122 objects 
in the validation set, we exclude 9,133; the percentage of 
outliers is 2.61%. This is consistent with the 2.51% rate of 
outliers in the training set. 

We show our results in Table [1] and the top panel of 
Fig. [T] in which we display predictions for 10,000 randomly 
chosen objects of the validation set. The accuracy of pre- 
diction via the Nystrom extension versus directly fitting a 
linear regression model to the diffusion map coordinates of 
the data is indicated in Table [T] We find that Rev increases 
by 2.4% from 0.0206 to 0.0211, with catastrophic failure 
rate rj increasing but still small. (Here, a catastrophic fail- 
ure for object i is defi ned as Si > 0.15; see equation |3] and, 
e.g.. Illbert et al.ll2006l .) The small degradation in accuracy 
is more than balanced by computational speed; our naive 
implementation allowed extension to 350,373 galaxies in ~ 
10 CPU hours on a single GHz processor, a computation 
time that will be markedly reduced in future implementa- 
tions of the algorithm. Rev = 0.0211 (0.0240 without nor- 
malization hy 1 + Z) compares favora bly with a m yriad of 
other analyses of MSG data (see, e.g., iBall et al] . who ob- 
tain cr = 0.0207 without l + Z normalization, and references 
therein), and the empirical bivariate distribu tion of (z, Z ) is 
visu ally indistin guishable from those of, e.g.. iBall et al.l and 
Coll ister fc Laha v (2004). 

We determine estimator bias by binning the predictions 
2 as a function of Z, then in each bin computing 2 — Z, with 2 
being a 10% trimmed mean. See the top left panel of Fig. (2] 
It is readily apparent that there is a downward slope in the 
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Figure 1. Top: predictions for 10,000 randomly selected objects 
in the MSG validation set, for (?, m) = (0.05,150). Bottom: same 
as top, for the LRG validation set, with (e,rn) = (0.012,200). 
In both cases, we remove 5a outliers from the sample prior to 
plotting, thus the actual number of plotted points is 9,740 (top) 
and 9,579 (bottom). 



bias (i.e., redshifts are overestimated at low Z, and under- 
estimated at high Z). This is not caused by model bias (a 
bias that one would mitigate by adding complexity to the 
model, e.g., changing from linear to quadratic regression), 
but rather by attenuation bias, in which measurement error 
(i.e., uncertainty in the predictor, in this case the diffusion 
coordinates) re duces the slope of t he regression line (see Fig. 
\3\ see also, e.g.. lCarroll et a0200d ). To demonstrate that our 
data are affected by attenuation bias, we perform a simple 
experiment. First, we take the MSG training set fluxes and 
resample them according to the prescription given in Ap- 
pendixlB] This increases all measurement errors. (To see this 
intuitively, imagine sampling random variables X ~ A^(0, 1), 
i.e., each value of X is sampled from a Gaussian distribu- 
tion with mean and variance 1. Then resample from the 
observed values X: Y ~ N{X, 1). The standard deviation of 
the resulting sample is now \/2, i.e., the error has been ar- 
tificially increased by resampling.) Then we resample fluxes 
for 1,000 randomly selected validation set objects. By doing 
each resampling (training set and validation set) 25 times, 
we build up a set of 625 predictions of for each of the 
1,000 selected objects. Following the same prescription as 
above, we estimate the bias; the top panel of Fig. |4] shows 
how for the MSG dataset, increasing the measurement error 
via resampling leads to a steepening of the bias slope, i.e., 
the effect of attenuation bias is magnified. 

There exist methods for correcting the bias in lin- 
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Figure 2. Top Left: estimated bias z — Z for MSG redshift es- 
timates z, computed in bins of width AZ = 0.01 in the range 
Z g [0.01,0.25]. Top Right: estimated standard deviation for 
MSG redshift estimates (normalized by I + Z). Middle Left and 
Right: same as top left and right, except for LRG redshift esti- 
mates in bins of width AZ = 0.02 in the range Z S [0.20, 0.44]. 
Bottom Left and Right: same as top left and right, except for 
DEEP2 redshift estimates (ZQUALITY = 4) in bins of width AZ = 
0.05 in the range Z 6 [0.0, 1.5]. 
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Figure 3. Simple demonstration of the effect of attenuation bias 
on linear regression. Left: example of linear regression fit to data 
with no measurement error in the predictor and with response 
Y ~ N{x,O.OA), where x = {1,2,3,4}, i.e., each value of Y is 
sampled from a Gaussian distribution with mean x and variance 
0.04. The black dots indicate the observed data, while the open 
circles show the true (x, y) values. Right: same as left, but with 
measurement error applied to the predictor: X ~ N{x, 1). The 
effect of this measurement error is to reduce the slope of the 
regression line, on average. The mean reduction in slope for this 
toy example is 0.25 (from 1 to 0.75), as estimated via 10,000 
simulations. 
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Figure 4. Top: change in the estimated bias z — Z induced by 
resampling MSG training and validation set fluxes and refitting. 
Because resampling increases the measurement error (i.e., the er- 
ror in the predictor, in this case the diffusion coordinates), the 
slope of the regression line is reduced, increasing overestimates of 
2 at low Z and underestimates of z at high Z. Bottom: same as 
top, for LRG datasets. 



ear regression coefficient estimation caused by addi- 
tive, heteroscedastic (i.e., non-constant) measurement er- 
rors of known magnitude that are bas ed on the SIMEX, 
or si mulation -extrapolation, alg orithm (|Cook fc Stefanskil 
1 19941 : see, e.g., ICarroll et al]|2006| and references therein). In- 
deed, one of the advantages to our approach is that the non- 
linearity is in the reparametrization, not the fitted model. 
Hence, available methods for correcting for measurement er- 
ror could be utilized. We are currently exploring the imple- 
mentation of SIMEX-based methods in a computationally ef- 
ficient manner, and we will present our results in a future 
publication. 

While attenuation bias is caused by measurement er- 
ror, its magnitude is affected by the distribution of the 
predictors, i.e., the design. Expressions relating the design 
to the bias magnitude are highly problem dependent. In 
the simplest, one-dimensional example of attenuation bias, 
the predictors are assumed to be normally distributed- 
X ~ N {^.x , cr^)-a,nA the effect on the slope j3i is to reduce 
its value: /3i — > A/3i, where A — a1/{cr1 + cr^) and (Ju is the 
measurement error. The smaller the value of a^, the greater 
the effect upon the bias. We mention this explicitly to un- 
derscore that analyzing samples for which the predictors are, 
e.g., uniformly distributed may reduce the magnitude of the 
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Figure 5. Left: estimated bias z — Z for MSG redshift esti- 
mates S", computed in bins of width AZ = 0.01 in the range 
Z a [0.01,0.25], for a 10,000-galaxy sample constructed so as to 
be uniform in Z. Uniformity in Z reduces the bias slightly (cf. the 
top panel of Fig. [2} . This result indicates that measurement er- 
ror is the dominant cause of the bias. Right: estimated standard 
deviation for MSG redshift estimates (normalized by 1 + Z). 

bias magnitude but will not eliminate it since measurement 
error is still present. In Fig. O we show the estimated sam- 
ple bias as a function of Z for a 10,000-galaxy sample con- 
structed so as to be uniform in Z (though the distribution 
of the predictors themselves-the diffusion coordinates-is not 
necessarily uniform). Comparing these results with the top 
panels of Fig. [2] we find that uniformity in Z reduces the bias 
slightly (while also slightly increasing sample standard de- 
viation) . This indicates that measurement error is the dom- 
inant cause of the observed bias. 

Nonparametric estimators such as k-nearest neighbor 
(kNN) and local polynomial regression are also affected by 
measurement error bias (whose mitigation is dubbed the 
"deconvolution problem") and design bia s, and in addition 
by boundary bia s (see, e.g., chapter 5 of ' WassermanI I2OO6I 
and chapter 12 of lCarroll et al. 2006 and references therein). 
Thus t he similarit y of our bivariate distribution to that 
of, e.g., iBall et al] (See their fig. 6. In this figure, we note 
slightly larger deviations from the z = Z locus at the end- 
points than our bivariate distribution exhibits, which may 
indicate boundary b ias but also could be a result of the 
fact that iBall et al] do not minimize risk and thus could 
be adopting a solution with relatively higher bias and lower 
variance than our solution.) 

In addition to estimator bias, we also examine the es- 
timator variance, i.e., the width of the observed bivariate 
distribution (given as a function of Z in the right column 
of Fig. [2I1 . Contributing to the variance is (a) model uncer- 
tainty, i.e., the standard deviation of the estimates z (given 
by the square root of the diagonal elements of the matrix 
given in equation IA3|I ; (b) uncertainty in the flux for each 
object; and (c) intrinsic scatter, i.e., the fact that the MSG 
sample does not necessarily contain a homogeneous set of 
objects. Model uncertainty contributes little to the observed 
scatter; the mean, median, and standard deviation of the 
model uncertainties are < 10~^. Flux uncertainty enters via 
attenuation bias; as flux errors increase, the linear regres- 
sion slope flattens and acts to decrease the sample variance 
within a redshift bin. However, in our simple attenuation- 
bias demonstration we observe only negligible changes in the 
sample variance. Thus we conclude that the observed sample 
variance is primarily due to intrinsic sca tter, and can only 
be reduced by introducing more data (cf. Illbert et al.l 120081 , 
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who achieve Rev ^ 0.01 by utihzing data from 30 bands in 
the UV, optical, and IR regimes). 

3.1.2 Luminous red galaxies 

From our data sample, we extract those 30,700 galaxies for 
which Z > 0.2 and P RIMTARGET = 32 (TARGETGALAXYRED; 
lEisenstein et al.||200lf ). This is our luminous red galaxy or 
LRG sample. As with the MSG training set, we randomly 
select 10,000 galaxies and then remove outliers. Because 
the It band data of high-redshift LRGs lacks constraining 
power (as LRGs are faint in u and thus the magnitudes 
are noisy), we use only griz fluxes in analyses (so that p 
— 12). The training set contains 9,734 objects. Application 
of the algorithm outlined in §§2.1-2.2 yields tuning param- 
eter estimates (?, m) — (0.012,200). The results of fitting 
are shown in Table [1] and the bottom panel of Fig. [1] As 
in the case of the MSG analysis, our value Rev ~ 0.0195 
(0.0270 wi thout 1 + Z no rmalization) compares favorably 
with, e.g., HaUeFaLlllooi, who achieve a = 0.0242 (with- 
out 1 + Z normalization), and references therein. We find 
that the outlier rate is consistent from training set to val- 
idation set (increasing from 2.7% to 3.1%), and that Rev 
increases by only 3.1% when we use the Nystrom extension 
as opposed to directly fitting the data. (Note that if we in- 
clude the u band, the estimate of ? increases by two orders 
of magnitude, indicating the scatter in colour space intro- 
duced by non-constraining ii-band data, although Rev it- 
self only rises by « 5%.) The LRG redshift predictions, like 
their MSG counterparts, are biased, with a similar down- 
ward trend in the bias as a function of Z (left middle panel. 
Fig. [2} . We repeat our simple resampling experiment with 
LRG data and find that the bias slope increases upon re- 
sampling, demonstrating that attenuation bias also affects 
LRG data analysis (as expected; see Fig. |4]). 

3.2 DEEP2/CFHTLS data 

The DEEP2 Ga laxy Redshift Survey (jPavis et all |2003| . 
iDavis et al.ll2007l ') studied both galaxy properties and large- 
scale structure primarily at redshifts 0.7 < z < 1.4, in four 
fields of total area ~ 3 square degrees. DEEP2 targets are 
select ed to have Rab ^ 24.1 using CFHT BRI photometric 
data (|Coil et all 120041 '). In three of the four DEEP2 fields, 
colour cuts are used to select z > 0.7 objects for observa- 
tion; however, in this paper we utilize the DEEP2 sample 
in the Extended Groth Strip, for which no colour cuts have 
been applied. DEEP2 collected spectra typically covering 
the wavelength range 6,500-9,100 A for > 50,000 objects. 
From the survey we select the 6,552 galaxies for which single- 
system ugriz photometry exists from the CFHT Legacy Sur- 
vey (field 03)0 and for which the DEEP2 ZQUALITY fiag is 
either 3 or 4 (> 95% or 99.5% confidence that the redshift 
is correct, respectively). Thus the dimensionality of colour- 
space for these data is p = 4. We further remove data for 
which the redshift error, or any magnitude or magnitude 
error, is not provided, leaving 6,418 galaxies; after outlier 

^ See |http : //wm«4 . cadc-ccda . hla-lha . nr c-cnrc ■ gc ■ ca/] 
communi ty/CFHTLS-SG/docs/cfhtls.html and lGwvnl ll2008l) . 
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Figure 6. Top: predictions for the 6,067 objects in the DEEP2 
training set for which ZQUALITY > 3. For these data, (e", fh) = 
(0.002,1050) and Rev = 0.0539. Bottom: same as top, for the 
5,223 objects for which ZQUALITY = 4; (?,m) = (0.002,850) and 
Rev = 0.0507. 



removal, the final sample size is 6,067. If we restrict our- 
selves to data for which ZQUALITY = 4, the sample size is 
5,223. 

Application of the algorithm outlined in §§2.1-2.2 
yields tuning parameter estimates (?, fh) — (0.002,850) for 
ZqUALITY = 4 and (0.002,1050) for ZQUALITY ^ 3. We dis- 
play our results in Table [l] and Fig. [6l note that because 
we do not apply the Nystrom extension here (but rather, 
fit to the data directly after (e,m) are determined), the ob- 
served scatter is smaller than we would observe with a larger, 
Nystrom-extended dataset. In both cases, we exclude 5.8% 
of the objects from analysis as outliers. 

In Fig. [6l we observe that the quality of the fits be- 
low Z ^ 0.75 (.Rev = 0.038 for ZQUALITY = 4) is superior 
to that at higher redshifts {Rev = 0.064). To understand 
why this is so, we examine the DEEP2 colour data (Fig. 
[7]). Pick an object at Z ^ 0.75, and compute the Euclidean 
distance in colour-space to a random object at any other 
redshift Z G [0, 1.5]. This distance is a nearly constant func- 
tion of AZ; thus for values of e similar to those chosen in 
the SDSS analyses, there is only a slightly lesser probability 
of diffusing from Z — 0.75 to, e.g., Z = 0.2 as to, e.g., Z — 
0.74. To achieve accurate predictions at Z ~ 0.75, e must be 
made smaller (lessening the probability of large AZ jumps); 
this is what our optimization yields. A consequence of a 
smaller 'e is that the weighted graph of the DEEP2 objects 
is not fully connected (see discussion around equation [1]) . 
One can discern connectedness by examining the vector of 
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Figure 7. Observed ugriz colours for the 5,223 objects in the 
DEEP2 training set for which ZQUALITY = 4. 
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Figure 8. Predictions for the 5,223 objects in the DEEP2 train- 
ing set for which ZQUALITY = 4, for e = 0.002 and m = 40 (top 
left), 100 (top right), 400 (bottom left), and 850 (m; bottom 
right). 



eigenvalues; for 0.002, the first « 20 eigenvalues are all > 
0.95, implying the presence of several disconnected clumps 
on the graph. The most visually obvious manifestation of 
disconnectedness in the DEEP2 analysis is the presence of a 
marked knee in the predictions at Z ^ 0.75 for small values 
of m (see Fig. [8)l; the dominant eigenvectors describe the 
low redshift data well, but not the high redshift data. As m 
increases, the knee straightens out; however, because of the 
bias-variance tradeoff, m can only increase so much before 



Rev begins to increase as well, due to increasing variance. 
For m = 850 (ZQUALITY = 4) we have not yet achieved an op- 
timal description for the high-redshift data. To demonstrate 
that we can achieve a better description of these data, we 
split the full dataset into low- and high-redshift sets (at, e.g., 
Zcut = 0.9) and compute diffusion maps for each. We find 
that we can achieve, e.g., Rev ~ 0.035 for high-redshift data 
with as few as 40 eigenvectors, while the predictions at low 
redshifts change only slightly. While splitting the data yields 
better results for our DEEP2 sample, we do not propose such 
splitting as part of our general diffusion map framework, for 
multiple reasons: (a) it adds a tuning parameter (Zcut), (b) 
it complicates the Nystrom extension (to which data split do 
we assign a new object?), and most importantly (c) a data 
split can be rendered moot with the inclusion of new data in 
other bandpasses (e.g., the inclusion of near-IR data in the 
DEEP2 sample would mitigate the Euclidean-distance issue 
seen at Z Ri 0.75). 

Concentrating on the regime Z < 0.75, we find that 
our result Rev ~ 0.035 with r i « 1.1% compares favorably 
with that of lllbert et al. !('2006'), who train a template-based 
photometric redshift code using 2,867 spectroscopic redshifts 
from the VIMOS VLT Deep Survey (VVDS) in the CFHTLS 
Dl field and obtain a — 0.032 and -q = 4% (see their §6.3 and 
fig. 14). Our smaller catastrophic failure rate is presumably 
largely due to our re moval of colo ur-space outliers prior to 
analysis. We note that lllbert et al.l perform a similar analysis 
with CFHTLS z-band data removed, with the result that a 
marked knee appears at Z « 0.8 that is similar to what we 
observe in analyzing our intrinsically bluer DEEP2 sample. 
This supports the hypothesis that adding data from other 
bandpasses to our DEEP2 sample will lead to a marked 
improvement in fit at redshifts Z > 1. 



4 SUMMARY AND FUTURE DIRECTIONS 

In this paper we apply an eigenmode-based framework utiliz- 
ing the diffusion map and linear regression to the problem 
of estimating redshifts given SDSS and DEEP2/CFHTLS 
ugriz photometry. Because estimating diffusion map coor- 
dinates via eigen-decomposition limits the size of training 
sets to ~ 10* objects, we implement the Nystrom exten- 
sion, which allows for computationally efficient estimation 
of diffusion coordinates with a relatively small degradation 
of accuracy. 

For our SDSS MSG sample, we train our linear regres- 
sion model on 9,749 randomly selected objects and via the 
Nystrom extension estimate redshifts for another 340,989 
galaxies. Since the Nystrom extension is not robust to ex- 
treme outliers, we use a nearest-neighbor algorithm to elim- 
inate 5cr outliers in colour space; this eliminates « 2.5% of 
the MSG sample. The loss in accuracy resulting from use 
of the Nystrom extension is ~ 2.4% (as compared with di- 
rectly fitting the data of the training set). For our SDSS 
LRG sample, we train our regression model on 9,734 ob- 
jects and via the Nystrom extension estimate redshifts for 
another 20,082, with an outlier rate ~ 3% and a degrada- 
tion of accuracy ^ 3%. As the DEEP2/CFHTLS sample has 
only « 6,000 objects (with an outlier rate of ~ 5.8%), we 
do not define a validation set to check the accuracy of pre- 
dictions generated via the Nystrom extension. However, we 
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will apply our regression model to a test set comprised of 
all galaxies in CFHTLS fields D1-D4 and make that catalog 
publicly available. 

The observed bivariate distributions {z, Z) for our 
SDSS datasets are sim ilar to those computed by, e.g., 
ICollister fc Lahavl (|2004l ) usi ng AMHz ( s pecific ally, for the 
SDSS MSG dataset) and by iBall et al.l using a nu- 

merically intensive nearest-neighbor algorithm (for both the 
SDSS MSG and LRG datasets), with dispersion on par with 
those techniques {Rev ~ 0.02; see lBall et al ] |2008l and ref- 
erences therein). These distributions indicate that redshifts 
are generally overestimated at low Z and underestimated at 
high Z. We demonstrate that this is a manifestation of at- 
tenuation bias, wherein measurement error (uncertainty in 
the diffusion coordinates resulting from uncertainty in the 
SDSS flux estimates) reduces the measured slope of the re- 
gression line. In statistical parlance, the measured slope is 
not a consistent estimator of the true slope. In order to use 
photometric redshift estimates in precision cosmology, it is 
vital that methods for producing consistent estimates (i.e., 
mitigating the bias) be developed and implemented. We are 
exploring usi ng the SIMEX, or sim ulation-extrapolation, al- 
gorithm (e.g.. lCarroll et al.ll2006l ) to produce consistent es- 
timates in a computationally efficient manner, and we will 
present our results in a future publication. 

For the DEEP2 data, the dominant feature in the ob- 
served bivariate distribution, beyond attenuation bias, is a 
marked reduction in prediction accuracy at redshifts Z > 
0.75. We demonstrate that this is due to a degeneracy in 
the colour-space manifold that would be mitigated with the 
introduction of more data from other bandpasses. We note 
that we also can mitigate the effects of the degeneracy by 
splitting the training set into low- and high-Z samples, but 
we do not prefer this approach because of the complexity 
it adds to the prediction algorithm (through the addition 
of a tuning parameter Zcut and the necessity of providing a 
quantitative measure for robustly choosing between the two 
predictions we would generate for each test object). At lower 
redshifts, we find that the observed bivariate distribution 
(2", Z) compares favorably with that derived bv lllbert et al.l 
(|2006 ') {Rev ~ 0.035 versus a = 0.032). 

Our current statistical framework yields a single photo- 
metric redshift estimate for each object in the validation set, 
as opposed to a p robability distrib ution function (PDF) for 
each estimate (cf. lBaU et al]l2008h . This is a valid approach 
for analyzing, at the very least, the gal axies of th e SDSS 
sample that we consider in this work, as iBaU et al.l demon- 
strate that the PDFs in the low-redshift regime are approxi- 
mately normal; we expect our single estimates to match the 
PDF means. However, we would have to alter our framework 
if we were to analyze quasars, for which the PDFs are often 
bimodal (e.g., fig. 5 of Bal l et al .). Bimodality is an indica- 
tion of (near-) degeneracy in the colour-space manifold; when 
its colours are perturbed, a quasar's nearest neighbor some- 
times belongs to one range of redshifts, and sometimes to a 
completely different range. Within our current framework, 
such a degeneracy would not affect the computation of the 
diffusion map, but the subsequent application of linear re- 
gression would yield inaccurate redshift estimates for those 
quasars in the vicinity of the degeneracy. For quasar analy- 
sis, we would explore a variety of options, which include (a) 



utilizing a different form of regression, (b) incorporating the 
response variables in to the construction of the diffusion map 
(jCosta fc Hero 2005), and/or (c) incorporating gradient in- 
formation into difi'usiori map construction, such that nearby 
objects that lie along the manifold have higher similarity 
measures. Such schemes would mitigate but not entirely lift 
the degeneracy and thus we would also have to quantify the 
relative probabilities of dual estimates. 

In this work, we demonstrate the efficacy of SCA, in par- 
ticular our diffusion map framework, for analyzing datasets 
for which the spectroscopic redshifts are known. The next 
step is to extend our framework such that it yields accu- 
rate photometric redshift estimates for objects in datasets 
where the spectroscopic coverage will be minimal, such as 
deep sky surveys (e.g., LSST) or pointed surveys beyond 
Z « 1. Even with long exposure times, the DEEP2 Galaxy 
Redshift Survey is only able to determine secure redshifts 
for ~ 70% of its objects, with about half the missed tar- 
gets being star-forming galaxies a,i Z > 1.4 that have 
no features in DEEP2 spectral window; cf. ICooper et al.l 
(i20(16). Even when spectroscopic redshifts are available for 
a significant subset of these objects, it is likely that they 
will be gleaned from intrinsically luminous objects whose 
SEDs may not closely match those for fainter objects. Thus 
it becom es imperative to fo l d add i tional information into 
analyses. [C oUistcr fc Lahavl (|2004l '). iBall et al.l Hooi), and 
IWrav fc Gu nn (200^" propose using structural properties 
such as surface brightness and angular radius to obtain more 
accurate redshift estim ates; however, t his is of limited util- 
ity at higher redshifts. iNewmanI ()2008h proposes that pho- 
tometric redshifts can be calibrated using their correlations 
on the sky with objects of known redshift, as a function of 
that known redshift. A related idea would be to take into 
account the redshifts of nearb y objects on the sk y in esti- 
mating photometric redshifts (|Kovac et al.ll2009l ): because 
of the clustering of galaxies, there is a significant probabil- 
ity that two galaxies near each other on the sky are at very 
similar redshifts. 

In a future work, we will fold additional quantities into 
our similarity measure and will determine if photometric 
redshift can be estimated with sufficient accuracy so as to 
fulfill their promise as a cosmological probe. 
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APPENDIX A: RELEVANT FORMULAE FOR 
WEIGHTED LINEAR REGRESSION 

Let X represent a matrix of predictors (in this work, the 
matrix of diffusion coordinates ^, where each row repre- 
sents the coordinates for a single object), let Y represent 
the vector of responses (the estimated spectroscopic red- 
shift values), and let S represent the covariance matrix for 
y, which we assume to be diagonal: 
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Then the best-fit coefficients are 
/3 = AF 
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the variance-covariance matrix for /3 is 
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= V(AF) 
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and the variance-covariance matrix for Y — X/3 is 

v(y) = v(x/3) 

= XV(/3)X^ 
= xfx^E"^X 



(A2) 
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APPENDIX B: RESAMPLING SDSS FLUX 
MEASUREMENTS 

We assume each flux is a normal deviate with error esti- 
mated by the Princeton/MIT data reduction pipeline. How- 
ever, fluxes in, e.g., different SDSS magnitude bands and 
systems are correlated random variables. In order to resam- 
ple fluxes accurately, we must take these correlations into 
account. For each object in the validation set, we have 20 
flux measurements F and estimates of flux standard error 
sf- The covariance matrix S is defined as 

^ 1 P1,2SFiSF2 ■■■ Pi,20SFiSF2o \ 

Pi,2SfiSf2 1 ••■ '■ 



V PI 



Pi,20SFiSF2i 



where pij is the sample correlation coefficient between mea- 
surements i and j (e.g., between the PSF it-band and 
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the Petrosian r-band). We estimate pi,j using Pearson's 
product-moment correlation estimator 




where s is the sample standard deviation. As expected, 
we find that fluxes measured via different systems within 
a single magnitude band are strongly positively correlated 
(p > 0.5); also, we find that fluxes across bands have non- 
negligible positive correlations, which we attribute to the 
relative homogeneity of the MSG sample (whose objects lie 
at relatively similar distances and display relatively similar 
physical characteristics). However, so as not to impose this 
homogeneity in resampling, we set pi^j = if indices i and 
j represent different magnitude bands. 

Wc use the Cholcsky method to decompose S into 
lower- and upper- triangular matrices A and A^. Then we 
can compute a new vector of fluxes: 

F'i=Fi + l^z, 

where z is a vector of standard normal deviates. 
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